Regression Notes - 1

This is the first notebook covering regression topics. It starts with basic estimation and diagnostics. This example uses a dataset I’m familiar with through work experience, but it isn’t ideal for demonstrating more advanced topics.

The sample data I’m using for this is 3_3_Solar_Y2019.xlsx from form EIA-860, available here. The data contains generator-level solar energy data reported by utilities to the Energy Information Administration for calendar year 2019. For reasons that will shortly become clear, I am only interested in photovoltaic (PV) generators.

With respect to PV generators, this is a proxy for the size of the installation (bear in mind that a utility-owned PV generator as reported in the data often consists of multiple panels in a “solar farm”, but the panels in an installation should all have the same tilt). I’ll be regressing this capacity data on the panel tilt angle to see if there is any relationship while also controlling for a few other factors that could also affect installation capacity.

First, I’ll import the data and do some basic visualizations on generation capacity (the amount of power in megawatts that can be output by the generator) and compare it to that of thermal solar installations.

#import libraries and import data as dataframe
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
data = pd.read_excel (r'C:\Users\garre\Documents\Python Scripts\datasets\EIA-860\3_3_Solar_Y2019.xlsx', sheet_name = 'Operable', header=1)



data.boxplot(column = 'Nameplate Capacity (MW)', by = 'Prime Mover', figsize = (4,8) )

<AxesSubplot:title={'center':'Nameplate Capacity (MW)'}, xlabel='Prime Mover'>

png

groupedbypm = data.groupby(by = data['Prime Mover']).sum()
plt.pie(x = groupedbypm['Nameplate Capacity (MW)'], labels = groupedbypm.index )
([<matplotlib.patches.Wedge at 0x1ee05b98b80>,
  <matplotlib.patches.Wedge at 0x1ee05b9e0a0>,
  <matplotlib.patches.Wedge at 0x1ee05b9e520>],
 [Text(1.099376518408859, 0.03703067338323363, 'CP'),
  Text(-1.096464773848027, 0.08811923575698509, 'PV'),
  Text(1.0928768186678455, -0.12498103543517913, 'ST')])

png

While they tend to be much larger than photovoltaic (PV) generators, solar thermal plants (CP and ST) are so few in number that they comprise only a small fraction of utility-owned solar generation capacity.

I’m interested in the relationship between PV panel tilt angle and the generator’s inverter loading ratio (the ratio of DC capacity to AC (nameplate) capacity).

First, I need to do some cleaning. I want to create a dataframe that only contains PV solar installations whose tilt angle was reported. I also need to refactor the tilt angle variable from object to float. Finally, I need to create a variable for the ILR.

There are some datapoints with unrealistically high ILR values. These are assumed to be due to reporting errors. Conservatively, we will drop observations with ILR values higher than two.

#create dataframe limited to photovoltaic generators whose tilt angle was reported
pvdata = data[(data['Prime Mover'] == 'PV') & (data['Tilt Angle']!=" ")].copy()
pvdata['Tilt Angle'] = pd.to_numeric(pvdata['Tilt Angle'])
#calculate Inverter Loading Ratio
pvdata['ILR'] = pvdata['DC Net Capacity (MW)']/pvdata['Nameplate Capacity (MW)']
pvdata['ILR'] = pd.to_numeric(pvdata['ILR'])
pvdata = pvdata.loc[pvdata['ILR']<=2]
pvdata['ILR'].plot.box(figsize = (4,8) )
<AxesSubplot:>

png

#scatter plot
plt.plot(pvdata['Tilt Angle'], pvdata['ILR'], 'o')
[<matplotlib.lines.Line2D at 0x1ee0874fa60>]

png

#obtain pearson correlation coefficient
from scipy.stats import pearsonr
pearsonr(pvdata['Tilt Angle'], pvdata['ILR'])

$\displaystyle \left( 0.08111342090515408, \ 2.274360325748276e-06\right)$

pearsonr(pvdata['Tilt Angle'], pvdata['Operating Year'])

$\displaystyle \left( 0.17748663027514733, \ 2.23648978601648e-25\right)$

The first value in the tuple returned by the function is the coefficient. The second value is a two-tailed p-value. The variables don’t appear to be strongly correlated. Controlling for installation age might reveal a relationship, so let’s continue anyway. Let’s regress ILR on tilt angle, the year it was built.

import statsmodels.formula.api as sm
import statsmodels.api as sma
model = sm.ols(formula = "pvdata['ILR']~pvdata['Tilt Angle']+pvdata['Operating Year']", data = pvdata).fit(cov_type='HC1')
print(str(model.summary()))
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          pvdata['ILR']   R-squared:                       0.055
Model:                            OLS   Adj. R-squared:                  0.054
Method:                 Least Squares   F-statistic:                     93.10
Date:                Thu, 05 Nov 2020   Prob (F-statistic):           4.35e-40
Time:                        17:40:04   Log-Likelihood:                 1566.2
No. Observations:                3388   AIC:                            -3126.
Df Residuals:                    3385   BIC:                            -3108.
Df Model:                           2                                         
Covariance Type:                  HC1                                         
============================================================================================
                               coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------
Intercept                  -24.2847      2.010    -12.079      0.000     -28.225     -20.344
pvdata['Tilt Angle']         0.0005      0.000      2.618      0.009       0.000       0.001
pvdata['Operating Year']     0.0127      0.001     12.687      0.000       0.011       0.015
==============================================================================
Omnibus:                       51.091   Durbin-Watson:                   1.203
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               79.362
Skew:                          -0.145   Prob(JB):                     5.84e-18
Kurtosis:                       3.692   Cond. No.                     1.50e+06
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC1)
[2] The condition number is large, 1.5e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
fitted = pd.DataFrame(model.fittedvalues, columns = ['Fitted'])
resid = pd.DataFrame(model.resid, columns = ['Resid'])
pvdata['yhat'] = fitted['Fitted']
pvdata['resid'] = resid['Resid']
pvdata.to_excel(r'C:\Users\garre\Documents\Python Scripts\datasets\EIA-860\post_reg_data.xlsx')
from statsmodels.stats.stattools import jarque_bera
from sympy import *
init_printing(use_unicode=True)
name = ['Jarque-Bera', 'Chi^2 two-tail prob.', 'Skew', 'Kurtosis']
normaltest = jarque_bera(model.resid)
x = (name, normaltest)
for line in x:
    print(*line)
Jarque-Bera Chi^2 two-tail prob. Skew Kurtosis
79.36218170079768 5.844148028089456e-18 -0.1446461440730232 3.6917353539381383

Jarque-Bera test results are reassuring, with skewness extremely close to 0, and kurtosis not far off from 3.

sma.graphics.plot_partregress_grid(model).tight_layout(pad=1.0)

png

sma.graphics.plot_regress_exog(model,"pvdata['Tilt Angle']" ).tight_layout(pad=1.0)

png

sma.graphics.plot_regress_exog(model,"pvdata['Operating Year']" ).tight_layout(pad=1.0)

png

plt.plot(model.fittedvalues,model.resid,   'o')
[<matplotlib.lines.Line2D at 0x1ee0db36a00>]

png

This is bizarre: A subset of the residuals form a perfect line.

plt.plot(model.fittedvalues[model.resid<0], model.resid[model.resid<0],   'o')
[<matplotlib.lines.Line2D at 0x1ee091bc580>]

png

Further analysis of the data shows that these residuals correspond to an ILR value of 1:

plt.plot(pvdata['yhat'][pvdata['ILR']==1], pvdata['resid'][pvdata['ILR']==1],  'o')
[<matplotlib.lines.Line2D at 0x1ee05b68ee0>]

png

pvdata['Tilt Angle log'] = np.log(pvdata['Tilt Angle']+1)
pvdata['ILR log'] = np.log(pvdata['ILR'])
linearlogmodel = sm.ols(formula = "pvdata['ILR']~pvdata['Tilt Angle log']+pvdata['Operating Year']+pvdata['Crystalline Silicon?']", data = pvdata).fit()
loglinearmodel = sm.ols(formula = "pvdata['ILR log']~pvdata['Tilt Angle']+pvdata['Operating Year']+pvdata['Crystalline Silicon?']", data = pvdata).fit()
loglogmodel =  sm.ols(formula = "pvdata['ILR log']~pvdata['Tilt Angle log']+pvdata['Operating Year']+pvdata['Crystalline Silicon?']", data = pvdata).fit()
plt.plot(loglinearmodel.resid, loglinearmodel.fittedvalues,  'o')
plt.show()
plt.plot(linearlogmodel.resid, linearlogmodel.fittedvalues,  'o')
plt.show()
plt.plot(loglogmodel.resid, loglogmodel.fittedvalues,  'o')
plt.show()

png

png

png


2020

Regression Notes - 1

4 minute read

This is the first notebook covering regression topics. It starts with basic estimation and diagnostics. This example uses a dataset I’m familiar with throu...

Lesson 1-3-1: Conditional Probability

less than 1 minute read

Now things are getting fun: We’re moving on to section 3, on conditional probability and independence! If you’re on mobile, you can view the lesson here: Pr...

Lesson R-1-2-6: Counting

less than 1 minute read

This is the coding supplement to theory lesson 1-2-6. If you’re on mobile, view this lesson here: Lesson R-1-2-6: Counting. You can also download the code ...

Lesson 1-2-7: Finishing Up

less than 1 minute read

Short final lesson on calculating probabilities by counting. If you’re on mobile, you can view the lesson here: Presentation 1-2-7: Finishing Up

Lesson 1-2-6: Learning to Count, Part 2

less than 1 minute read

And now for the thrilling conclusion to our two-part lesson on counting! If you’re on mobile, you can view the lesson here: Presentation 1-2-6 Learning to Co...

Lesson 1-2-5: Learning to Count, Part 1

less than 1 minute read

Think you know how to count? Well, think again! It’s time we covered the Fundamental Theorem of Counting! This two-part lesson will be followed up by a cod...

Lesson R-0-1: Prologue: Obtaining R

less than 1 minute read

Pretty soon we’re gonna need to write some code. Luckily, I’m here to teach you how to do it! We’re gonna start by installing R and RStudio. Then, we’ll g...

Lesson 1-2-3: Axioms and Consequences

less than 1 minute read

Now it’s getting serious.  We’ve established the axioms of probability, but there is so much more to be done with them!  Some of it, sadly, gets a little inv...

Lesson 1-2-2: The Axioms, Explained

less than 1 minute read

Now we’re getting to the good stuff!  We’re not only taking a deeper dive into the axioms of probability, we’re also proving that a (relatively) simple funct...

Lesson 1-2-1: Of Axioms and Algebras

less than 1 minute read

We’re ready to start working toward an axiomatic definition of probability.  First, however, we take a rather painful detour through the realm of Borel field...

Lesson 1-1-3: Set Theory Conclusion

less than 1 minute read

This lesson rounds out our primer on set theory by introducing operators that allow us to take the union or intersection of many (or even infinitely many) se...

Lesson 1-1-2: Event Operations and Properties

less than 1 minute read

This lesson explores event operations, specifically unions, intersections, and complementation, as well as the properties of these operations with detailed v...

Lesson 1-1-1: The Sample Space and Events

less than 1 minute read

This lesson begins our journey with the introduction of the sample space, and the events therein.  If you’re on mobile, you can view the lesson here: Present...

Back to top ↑